20 research outputs found
A Survey of Paraphrasing and Textual Entailment Methods
Paraphrasing methods recognize, generate, or extract phrases, sentences, or
longer natural language expressions that convey almost the same information.
Textual entailment methods, on the other hand, recognize, generate, or extract
pairs of natural language expressions, such that a human who reads (and trusts)
the first element of a pair would most likely infer that the other element is
also true. Paraphrasing can be seen as bidirectional textual entailment and
methods from the two areas are often similar. Both kinds of methods are useful,
at least in principle, in a wide range of natural language processing
applications, including question answering, summarization, text generation, and
machine translation. We summarize key ideas from the two areas by considering
in turn recognition, generation, and extraction methods, also pointing to
prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of
Informatics, Athens University of Economics and Business, Greece, 201
Deep Learning for User Comment Moderation
Experimenting with a new dataset of 1.6M user comments from a Greek news
portal and existing datasets of English Wikipedia comments, we show that an RNN
outperforms the previous state of the art in moderation. A deep,
classification-specific attention mechanism improves further the overall
performance of the RNN. We also compare against a CNN and a word-list baseline,
considering both fully automatic and semi-automatic moderation
EDGAR-CORPUS: Billions of Tokens Make The World Go Round
We release EDGAR-CORPUS, a novel corpus comprising annual reports from all
the publicly traded companies in the US spanning a period of more than 25
years. To the best of our knowledge, EDGAR-CORPUSis the largest financial NLP
corpus available to date. All the reports are downloaded, split into their
corresponding items (sections), and provided in a clean, easy-to-use JSON
format. We use EDGAR-CORPUS to train and release EDGAR-W2V, which are WORD2VEC
embeddings for the financial domain. We employ these embeddings in a battery of
financial NLP tasks and showcase their superiority over generic GloVe
embeddings and other existing financial word embeddings. We also open-source
EDGAR-CRAWLER, a toolkit that facilitates downloading and extracting future
annual reports.Comment: 6 pages, short paper at ECONLP 2021 Workshop, in conjunction with
EMNLP 202
Making LLMs Worth Every Penny: Resource-Limited Text Classification in Banking
Standard Full-Data classifiers in NLP demand thousands of labeled examples,
which is impractical in data-limited domains. Few-shot methods offer an
alternative, utilizing contrastive learning techniques that can be effective
with as little as 20 examples per class. Similarly, Large Language Models
(LLMs) like GPT-4 can perform effectively with just 1-5 examples per class.
However, the performance-cost trade-offs of these methods remain underexplored,
a critical concern for budget-limited organizations. Our work addresses this
gap by studying the aforementioned approaches over the Banking77 financial
intent detection dataset, including the evaluation of cutting-edge LLMs by
OpenAI, Cohere, and Anthropic in a comprehensive set of few-shot scenarios. We
complete the picture with two additional methods: first, a cost-effective
querying method for LLMs based on retrieval-augmented generation (RAG), able to
reduce operational costs multiple times compared to classic few-shot
approaches, and second, a data augmentation method using GPT-4, able to improve
performance in data-limited scenarios. Finally, to inspire future research, we
provide a human expert's curated subset of Banking77, along with extensive
error analysis.Comment: Long paper accepted to ACM ICAIF-2